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ABSTRACT 

I 

Item bias, when present 1n a multiple-choice test, can be detected by 
appropriate analyses of the persons x Items' scoring matrix. Five related 
schemes for the statistical analysis of bias- were applied to a widely used, 
primary skills multiple-choice test which, was administered in either its 
English- or Spanish-language version at each of the two levels, to 1259 
students In bilingual education programs. The results indicate that from 
" one-fifth to one-third of the Items 1n the tests show strong evidence of 
bias, corroborated by a separate analysis of linguistic and caltural sources 
of 'bias for both the biased Items and those Items with no statistical 
findings of bias. 
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Introduction * 

/A systematic but unanticipated pattern of responses to a multiple-choice 
test found for an entire group of test-takers is generally regarded as 
evidence of bias. This interpretation. results from indications of one or 
more differences between groups on levels of knowledge and skill, or in 
linguistic and cultural issues related to the use of language in the test. 
However, the behaviors of individual' respondents have important consequences 
for that interpretation. Whether the respondent unerringly picks the correct 
response, or successfully engages in elimination of incorrect answers, or 
guesses well, the observer scores the item "correct" and concludes that the 
student "knows" the required skills or material. The inference that the 
respondent "does not know" is made whether he/she guesses incorrectly, 
eliminates wrong choices badly, or chooses an attractive but incorrect 
alternative. 

. Most likely, phenomena looking like systematic patterns of bias in test 
items^are the results of complex interactions of these group and individual 
factors with one another and with certain properties of the test items. 
What is required to make sense of the issue of bias is analysis of patterns 
found in these combinations of performance. The multiplicity of possible 
patterns suggests that the detection and interpretation of bias must be- 
cpnducted along several routes. 

Goals of This Research 

The first of two purposes of this paper is to investing analyses of 
the persons x items scoring matrix of a test for the detection of item bias. 
The persons x items scoring matrix contains a significant amount of infor- 
mation about the patterns of responses generated by a set of examinees. 
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Using a few geometrical and statistical considerations, the patterns of 
"responses fronf separate groups of examinees tested with the same Instrument 
can be compared. * If these patterns show that the test 1s not measuring the. 
same thing — skills, competence, thinking abilities in comparable groups, 
if the groups ar> responding to different aspects of,.the test items, or if 
cultural and/or linguistic issues take precedence, it may be that the test 
1s biased. 

The second purpose of this paper is to study empirically the question 
» "~f • * \ 

of bias as shown by these several techniques in the context of a widely used 

achievement testjj-the Comprehensive Tests of Basic Skills (CTBS), which has 

been translated from English into Spanish. The claims made about this 

instrument include the statement that'the Spanish-language version represents 

a *cl ose -replicate of the English -language version with careful attention 

o 

having been exercised fn removing all forms of unintended bias. The primary 
task of this analysis is to ascertain the degree of comparability of the two 
versions of the CTBS' in the assessment of similar groups of children, and to 
see 1f any bias remains. •« 

Related Literature 

A substantial research literature has developed around the term "item 
bias" in the search for a single best all-purpose indicator which always 
reveals bias whenever systematic discrepancies 1n performance between groups 
are found. A large number of methods have been proposed and a large number 
of studies conducted (cf. reviews in Berk, in press; Subkoviak, Mack, '& 
Ironson, 1981). Certain tests such as the Wechsler Intelligence Scale for 
Children have been extensively investigated (cf. Sandoval, 1979). The range 
of applications of the term "bias" tsquite bread f studies have examined 



soclocuftural bias and the stereotyping of items and answers, cultural 
differences » and linguistic variations (cf. Jensen,- 1980);: construct bias 
and the different aspects of performance tapped in different examinee groups 
by the same.test (cf. Evel, 1975); and contextual bias and the misuse of tests 
with specific groups (cf. Williams, 1971). Occasionally the word is even 
used to mean a conscious preference on the part of the examinee (Hudson ,1963). 

Increasingly complex techniques have been set forth for the detection 
of -bias in -Items. Methods have been based on analysis of variance, trans- 
formed Item difficulties i factor techniques, adjusted chi square procedures, 
dlstractor analyses, "adverse impact" and item characteristic curves (Merz, 
1980; Petersen, 1980; Rudner.'Getson, & Knight, 1980). Many of these methods 
are statistically complex but, with the exception of the last, statistically 
inelegant (Hunter, 1975).; unfortunately the most elegant solution, item 
characteristic curve analyses, requires large numbers of items and respondents 
tor ns computation. Few of these approaches offer convincing or useful 
explanations" of why some, items are biased and others are not (Crowder, 1979). 
Faced with the -multiplicity of both the forms of iteiHias and the statistical 
methods 'that have beert^put forward to detect such bias, or\e logical place 

c 

to begin, is to Inquire about the nature of a test which is absolutely free 

« 

of bias. 

An Unbiased Test 

If a test could be created which fulfilled all of the requirements of 
a bias-free instrument, its items would all measure the same trait or ability 
and be equally reliable and equally valid for all groups (Petersen, 1S80). 
It would also show orderly variation in the relative difficulties of the 



Items, .and be responded' to In an orderly, manner by every Individual. One • 
example of the outcome of this improvable creature 1s the familiar perfect 
Quttman scale, 1n which persons are perfectly ordered by increments of skiJl \ 
level, and items within. the test are prefectly ordered by increments of „ t ' 
difficulty. No higher-level item is mastered by any respondent until each 
lower-level item is mastered; guessing also plays no role. The sequence of 
successes and failures is highly deterministic. 

Figure 1A represents a ten-item test with right/wrong scores for^tenl ^ 
respondents. These ten persons never successfully answered a more difficult 
Item without first having succeeded on a less difficult item. An axis of 

r » 

.performance can be drawn on the diagonal to separate all correct scores from 
all Incorrect scores. While the total p-value for the test is lower for 
another group of ten persons tested on the same ten Items, shown 1n Figure IB, 
1 the performance patterns are parallel. Other than. a main effect due to 
groups, nowhere in either diagram is any Indication of a Systematic un- 
expected difference tn ^he pattern of responses or bias 1n the test. 

• -< ' ! 

A Slightly Biased Test 

. * A somewhat less artificial example. of test results from a multiple- 
choice test 1s' shown in Figure 2A; the score matrix of a hypothetical ten-item 

l nS ert_F1gures_Wjnd_2B_about_here 

test has been sorted "by both persons, on ascending total score, and By items,, 
on -ascending level of difficulty? Neither ^ons .ior Items 1s perfectly 
* ordered 1n the sense used above, and guessiny of correct answers probably 
contributes by. an unknown amount to the scores obtained. Not one but two 
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dividing lines are now 'required to separate the patterns of performance 1n 
\ this "figure. The first 1fne, a cumulative ogive representing student per T 
' formance, is drawn on the matrix based on the total correct score for every 
respondent. The second, representing problem difficulty, is drawn as a 
cumulatlve-o^lve based on item p-values. Note that for a test which demon- 
strates exclusively random responding,' the theoretical position of the student 

curve (S-curve) would be vertical, and of the problem curve (P-curve), 

1 ° 
horizontal. 

At this juncture we introduce a second set of data obtained from the 
same hypothetical test. The "respondents" were slightly less capable on most 
Items but all other co*1derations were held equal. "A score matrix for 'the 
same set of items as shown 1n Figure 2A but now with the second group of 
examinees is shown in Figure 2B. The relative order of items is somewhat 
changed becauscof differing levels of difficulty, the second group performs 
less well overall than the first group. Statistical differences between the' 
data'inVigures 2A and 2B should reflect overall item and group differences, 
but because of the Idealized symmetry between the two, there is little 
likelihood that a statistical indicator of bias would prove significant. An 
Initial analysis of these figures recommended by Jensen (1980) is a two factor 
(group x items) nested analysis of variance. The. fnterpretation of a sig- 
nificant groups effect, in the absence of other significant factors, is that 
the groups behave symmetrically with respect to ordering of item difficulties 
but that one group Is consistently more capable across the trait being 
appraised*by this test. A significant difference on both the groups and items 

^ i 

factors,- plus a : significant interaction between groups and items, together 
suggest that the test items and examinee abilities in the two groups are 



heterogenous. 1 However, these- findings would be quite insufficient to say 
that the test.ts biased (Hunter, 1975) and, aWtibnally, do not account for 

/ V " 

the contribution of guessing.. » \ ; "** T 

A second apprpach recommended by^Jensen (1 98G ) for understandi ng the 
differences between the tiro figures u&esNthr-phi coefficient, which is the 
correlation Obtained between the group response to a given item and the same 
group's response to. any other item in the test. Phf'fs a measure' of joint 
contingency; Jensen explains Its, .use. for analysis: of bias: * 

Only if the two items have the same, difficulty,,, can phi be « 
equal to 1.*To determine the intrinsic correlation (.of the 
items) free of the Influences in Item difficulty, we must 
divide the obtained phi by the. maximum value of phi that 
could possibly be obtained with the given marginal frequenf 
^1es.(p.43l). \ v • 

' ■" * v i ----- < . 

The ratio* of phi to maximum Value of phi i* summed over all possible pairs 

of Items for each group, and then the ratios are compared. The null hypoth- 
esis for this comparison is that the difference between the obtained sums 
'is not different from ramic-nnesr, an<i.thus there 1s ho systematic discrepancy 
In group performance. Tn the artificial situation shown by the Guttman . 
sqale for both groups in Figure 1, this test is necessarily nonsignificant. 
For data which do not fit the mandates lof a perfect scale, the obtained 
value for the comparison of ratio sums increases as the discrepancy in overall 
patterns of response by~the,two separate groups widens. While the amount of 



krhe comparisonjif Figures 2A and 28 yields only a significant difference on 
the factor of items (F(9,162)«13.98, p<.001). 

2 For the difference betweert Figures Zh and 28* x 2= 8.0222, p<.01,« 



difference bej^een "groups ts gtven by the analysts of variance and phi, the 
nature of patterns of response to /teems ts nat adequately exRlaine^ 

Only a- small number-of stattstlcallyrbased analyses specifically designed 
to study patterns of responding to multiple-choice tests have been proposed. 

Tatsuoka 0981) and Harntsch and Linn (1931) have been workings a norm 

I iL * * 

conformity index and-pther parameters which address each individuals per- 

formande in the context of patterns obtained4>y all members of the group. 

Satq (1^80). defines an index of disparity between actual and ideal response 

patterns which can be-appHedto individuals or to Items. To unravel the 

9 * 

problem of patterns, we now' turn to Sato's system of. analysis of the persons 
x items matrix. ( 

9 - # 

V > 

The S-P Method and Analysis of the Person x. Items Matrix t 

The key element in Sato's (1980) S-P method of analysis of test perfor- , 

nance ts the doubly-ordered persons x items matrix, with student curve (S-curve 

and problem curve (P-eurve) drawn in.. In Japan, tJjis procedure is widely used 

in. classrooms to obtain the characteristic performance of the set of examinees, 

which may be compared visually to several "standard" curve~f uncti ons for 

3 

diagnostic purposes. ^ 

Sato has developed an index of discrepancy to evaluate the degree to 

> 

which the S and P curves do not conform either to one another or to ttfe . 

6 

Guttman scale. Except in the case of the perfectly ordered sets shown in 

~ 3 Direct thT< M | ii >h i l t) ( m <» r Tf rm inom , p rffiftn irirrf , and the amount of 
discrepancy between the S and P curves is relatively easy to accomplish-; 
the same holds for item analysis, individual performance analysis, arju 
other sunrnary statistics within a group. In Japan, this system has been 
automated using a microcomputer (Sato, Takeya., tturata, Morimoto & 
XhTmura, 1981). 'i 



FtflufVl, there ts always some degree of discrepancy between curves. The 

index is explained as follows •> / 

. *i - 4 . 

D*> AJMiPj. wnere the denominator 
Agl N.n.p J 

°...is the area between the S curve and the P curve in the given 
S-P. chart for a group of N students who took n-problem test and 
got an average problem-passing rate-p\ and. Ab ( N, n, p) is the 
•area between the two curves as. modeled by_ cumulative binomial 
distributions with parameters N, n, and p, respectively (Sato, 

. 1980, p. 15). 

The denominator is a function which expresses a truly random pattern of 
responses for' a test with a given number of subjects, given number of items, 
and given average passing rate, while the' numerator reflects the obtained 
pattern for that. test. As the value of this ratio approaches 1.0, it portrays 

an! increasingly random* pattern of- responses. For the perfect Guttinan scale 

• 4 
as represented by Figure 1 , the numerator will be O.and thus D* will be 0% 

Indices of discrepancy, wn^Tcomj&uted for eath of two groups of examinees 
*\ • * * 

may' nit be 'statistically compared bpcause of differences in ranking of item 

difficulty, and/or compound differences in 'response patterns to several 

items. However, as long as the two D* values obtained are not equivalent, 

it is an indication tha.t somewhere within the matrices are one or more »tems 

which are behaving dissimilarly acrdss groups. 

Analysis 6f Respondents Above P< Curve 

Patterns of discrepant performance result from a mixture of random 
behaviors and wrorig choices, except for those items which are so easy that 
no respondent gets them wrong. Aside from the tautology that respondents • 



4 In Figured,' D* - .2534; in Figure 2B, D* = -.3747. 
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wtth. less ability are less likely to answer a gtven item correctly, all 
other tMngs betng equal ihey are also likely to use chance responding. 
Analysts of those respondents who are unlikely to be answering randomly would 
seem a likely means to understanding patterns and bias 1n Items, To begtn 
constructing a simple analytic solution to this problem, suppose we take a 
single uncomplicated item from the S-P chart, and examine the pattern of 
responses for only that portion of the same group of examinees for whom the 
prediction of success 1s relatively high, I.e., those above the*-curve. 
These are the examtnees who tended to score better overall, • Specifically, 
respondents at the very top of this select subgroup, are expected to have had 
a finite but small probability of having guessed their way to success. Re- 
spondents at the bottom of%»is select subgroup would have a finitely larger 
probability, while those at the very bottom of the entire S-Pchart would be 
likely to have a more random pattern. 

If the selected item, however, 1s one for which no individual within the 
sample", no matter how skilled, is able to answer knowledgeably , the response 
pattern among' the select group of putative "masters" should be random, and 
should'not differ from the response pattern of those examinees not Included in 
this subgroup. For a four-choice item of this kind, the Item's* p-value should 
be about .25* and the select subgroup of putative ".rasters" would be correct 
only 25* of the time. Figure 3 illustrates a pattern of responses for a 
nearly random Item, in contrast with an Hern which is fairly well-fitted to the 
skills of a sec' of respondents. 

' The proportions of "masters" who are indeed correct can be compared 
between- groups. With relatively uniform variances, the test of significant 
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differ ice tn Independent proportions applied to this problem yields a z 
score; a signtficant z score would be an Indication of .possible bias separate 
from trie difference In average passing rates for that Item, if any. A 
comparison of nonuniform variances requires transforming the Item difficulties 
into standard score form, then testing the size of the difference following 
Rudner, Getson and Knight, (1980). Within certain limits, an Item which is 
relatively easy for one group and relatively difficult for another, may show 
no bias In the proportions of ^masters" who are correct because those indi- 
viduals who place above the P curve all have the ability to answer that item 
correctly^ However, on another Item one of two groups may not be academically 
equipped, or may be prevented from responding by biases 1n the test, curriculum, 
,or culture; thus the proportions may differ, possibly by an amount sufficiently 
large to be deemed significant. 

Analysis of Distractors 

One further analysis of the potentially biased Item is to examine the 
patterns of wrong answers made by the separate groups of respondents. Within, 
the multiple-choice test format, differences between groups 1n the attrac- 
tiveness of incorrect responses signal that the item's wrong choices may be 
differentially distracting. When a given Item has attractive but Incorrect 
d' responses for one group, Goodman and Krusk&Ts Lanfcda Indicates whether another, 
group shares the same proportional pattern of selecting those Incorrect 
responses (Veale & Forman, 1976). Lambda 1s an Index of predictive associa- 
tion, which shows "...how one 1s Jed to predict differentially in light of the 
relationship..." (Hayes, 1963, p. 610, Italics original). It 1s calculated for 
a problem involving two groups by evaluating the largest discrepancy between 
rates of responding to similar wrong choices; 
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where im*.^ * s the larger frequency of the two groups for any single wrong 
choice, and max.f fe ts the larger marginal frequency of the two groups summed 

across all wrong choices. 

..*»>. 

In Goodman and Kruskal.'s lambda is appreciably above zero, the inter- 
pretation can be made that the pattern of distraction is different for the 
two groups. If the index Is zero, even though the difficulty of the item and/or 
the proportions who select a wrong option may differ between the two groups, the 
pattern of selecting the wrong answers is about the same. 

Another check on the relative attractiveness of a wrong answer can be 
made by counting the number 6f wrong answers which are chosen at least 10% 
more often than the next most popular wrong answers. These particular wrong 
choices constitute a class of "popular dis tractors," each of which can be 
studied further. The easiest comparison is between those items for which 
both groups picked the same popular dis tractor and those items for which both 
groups picked different popular distractors. Note that in this latter case, 
the computation of lambda will always yield a nonzero value. 

A series of analyses of item bias has been described, with special \ 
attention paid to those comparisons premised on the persons x items scoring 
.matrix, doubly sorted. The following sections describe the execution of these 
analyses In the context of a multi-language achievement test. 
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Instruments 

"""" i 

For a study of the posktble bias Inherent tn a multl -language test, 

t*o levels of the Comprehensive Tests of paste Skills CCTBS) published by 

CTB/McGraw H111 (1974, 1978) were administered in this study. Students 1n 

y grades 2 and 3 were given the CTBS Level C; participating fifth and sixth 

. grade students took Level 2. CTBS-Epglish Level C 1s designed for student^ 

tn grades 1.6 to 2.9; CTBS-Spanijsh Level C 1s designed for students 1n grade 

2. CTBS-English level 2 has a target population in grades 4,5 to 6.9; the 

Spanish translation was designed for students in grades 5 and 6. 

The CTBS-Engl1sh and CTBS-Spah1sh tests were selected for several , 

1 * I 

reasons. Test content is roughly pirallel. The CTBS-Spanish was the first 
test at CTB/McGraw Hill to be subjected o to a four-step editorial procedure 
designed to reduce test bias; Included were studies of content validity, 
application of editorial guidelines 1n Item construction, reviews for bias, 
and separate ethnic group pilot studies with the test. In the translation of 
the CTBS from English to Spanish, the test developers tried to keep the test 
content and measurement features intact. This, of course, meant that in . 
some cases word-for-word translations were not possible. Nevertheless, the 
publishers^ intent was to provide tests that are similar In rationale and 1n 
the process/content classification scheme. Thus, both the English- and 
Spanish-language versions used 1n this study purport to measure the following 
objective: 

1. the ability to recognize or recall Information 

Z. the ability to translate em^o^rt concepts from one kind of 
language Cverbal or symbolic) toawther 

ERIC . , 16 
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3. the aMItty to comprehend concepts and their Interrelationships 

4. the ability to apply techniques, including performing operations 

5. the ability to-^xtend interpretation beyond stated information 
tCTBS, 1974/1978) j 

Test length, test time, and administration procedures are exactly the 
same for English and Spanish versions of each test level. 

Subjects 

Five school districts in the state of California participated in the 
study. The total number, of pupils tested was 1259, representing 81 intact 

classrooms. C 

Classrooms were selected to represent a wide range of program options. 
The criterion for selection of school districts was that they had b1 lingual - 
bicultural educatton programs funded by Tljtle VII. Potential participants 
were Identified from schools listed 1n the California State Department of 
Education 1979 Bilingual Program Directory. From this 11st, Invitations 
were sent to schools which had at least two classes at the same grade level 
(grades one, two, five or six) having bilingual programs. Additionally, 
instruction had to be delivered in self-contained, multlsubject settings; 
departmentalized or pull -out programs were excluded. 

Analyses 

Five statistics' explained above were used to evaluate the data for every 
item separately. Each uses a minimum threshhold value, above which the result 
is taken as an indication of possible bias 1n".the Item. The analyses and "their 
miniraums can be summarized as follows: 

a ) Test of proportions of correct scores : across groups, a difference 
between transformed p-values wrncn generates a z>l,96; 
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b) "test of proportions of correct scores for "masters" : across groups, 
a difference between proportions of those respondents above the 

" P-curve wher make errors, which generates a z>l,9S; I 

c) Test of chance responding by "masters" :, within each group, a difference 
between the obtained proportion of those passing the item and a 
theoretical p-value of .25, which generates a z<l,96i 

d\ Test of differential attractiveness of wrong answers ; a Goodman 
and kruskaUs lambda computed on the proportions ot Incorrect answers 
by choice within item, such thatx>0.0;. 

e) Test of popular distractors : a wrong choice for an item attracting 
at least 10% or more responses than the next most popular wrong 
jchoice for that time. 

Resul is ' 

The number of items within each subtest by level , and the number of 
students in each of two language groups who were included, are shown at the 
top of Table 1. Item P- values indicate that items ranged from moderately easy 
to very difficult for both language groups, with an overall mean of somewhat 
over half of the Items correct. While (n a few items the Spanish-language 
! Inse£t_l5We_l_about_here 

group did better, without exception the Spanish-language groups always scored 
lower overall on the subtests. In every instance the maximum p-values achieved 

0 

__by- th*-Engl ish- language groups ar^s+lghtly-higher than the comparable scores 

for the Spanish language groups. Table 1 also shows for the corresponding . 

number of students, the p-value needed for a siginficant (p<.05) difference 

from chance responding to an item. This figure is obtained by reversing the o 

usual computation for the test of independent proportions, using z « 1.96 and 

n * * .25. For all but one of the subtests, both language groups had one or 
H chance 

more Items which appear to represent random choice of the, correct answer. 
Except for the Passage Comprehension subtest at Level C, the Spanish-language 
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group appears to make random selections more often than the English-language 
group, an assumption which is further explored below, 

-i For purposes of illustration, two analyses recommended by Jensen (198Q) 
were conducted on the subtest with the small est number of .items, Level C 
Passage Comprehension, The two-facto*" nested analysis of variance for this 
subtest shows a significant effect due to the groups factor (F 0 ,650) = 54.91, 
MSerror = 1.37} and a significant effect due to the interaction between 
items and groups (F (17,11050) « 2.61, MSerror = 0.43). The ratio of phi to 
phVmax is higher for the English-language sample than for the Spanish- 
language sample (English mean ♦/♦-max - .8207; Spanish mean ♦/♦-max = ,766, 
t (151) 3 4.01, p<.01). This brief set of findings indicates only that the • 
language groups are not performing the same way as one another on the subtest. 
It seems that the Sapnish-language sample may have had more difficulty with 
some items than did their English-language counterparts. No further detail 
can be learned from these analyses, and they are not used in the study of . 
"the remaining subtests. 

The S-P charts were drafted for each subtest by language group for a 
total of eight complete charts. The index of discrepancy 0* is presented in 
-the last row of Table 1. The fact that the D* values are higher for the 
Spanish- language groups suggests that they engaged in patterns closer to 
chance responding more often than did English-language groups. While the 
differences between pairs of D* values are large for the Passage" Comprehension 
subtest at both level C and level 2, these values cannot be compared further. 
The specific reasons why .the Spanish-language versions generate larger 0* « 
values can only be made evident with further analyses. 

Results from the set of five analyses which together provide sufficient 



evidence of patterns of dtscrepant performance are presented below and in 
Table 2. The table shows percentages of items for each of the four subtests 
1n thts study which exceed a critical minimum on each of the five analyses. 

Test of proportions of correct scores. The first of the concise set 
of analyses is the test of proportions, which is applicable to percentages 
of correct answers expressed in standard score form, for both groups on each 
Item of each subtest. The first two rows of Table 2 show the percent of 
items favoring the English- or Spanish-language groups. Six out of every 
ten items in the Vocabulary sybtests show significant differences between 

groups; 1n a majority of instances the higher group 1s always the Engllsh- 

I 

" language group. * Half of the items in the Passage Comprehension subtest at 
Level C show a significant difference and over three-quarters of the Items 1n 
that subtest at Level 2 show a significant difference; In no Instance are the - 
Spanish-language groups ahead of their English-language counterparts. 

Test of proportions of correct scores for "masters" . Both the second 
and third analyses 1n this set are based on the selective sample of "masters", 
those students whose overall scoring position places them above the P-curve 
for each Item. By evaluating the proportions of correct scores for those 
members of the language groups, a 11st of statistically significant discrep- 
ancies between "masters" 1s generated,: The third and fourth rows of Table 2 
show the percent of Items within subtest for which the success rate among 
''masters" is significantly higher for the English-language "or Spanish- 
language groups. The Passage Comprehension subtests at both levels appear 
to have different rates at which the "masters" are able to avoid the wrong 
answer; In the majority of instaneet^the rate Is higher for the Engllsh- 
language groups'. In the Passage Comprehension subtests, the rate is uniformly 

4*" * f 
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higher for the English-language groups. 

Test of chance responding by "wasters" . How 'often the samplas of 
"masters" are not able to choose the correct response at a rate better than 
chance forms a third part of the analysts, , The fifth and sixth rows of table 2- 
show that for the Level C subtests, no items are founder -which either group 
responded randomly. However, for Level 2, a small num6er of items in both 
subtests elicited chance responding by "masters". These items appear to be 
so difficult that not even the better students could Vnowledgeably sefect the 
correct response. The Spanish-language group has a. much larger number of 
chance responses among "masters" than the English-language groups on the 
Level 2 Passage Comprehension subtest. 

Test of differential attractiveness of wrong answers. The fourth 



"analysis tilhts^uence 1s~the analyses of differential patterns of- In- 
correct responses, Goodman and Kruskal's lambda was calculated for each 
Item, using a 2 x 3 table of groups by Incorrect response rates, Values 
ranged from 0.0 to .23, with a median of 0. Lambda will be 0 for any 2 x 3 
table of proportions for which both groups are attracted to the same re- 
sponse, even if the actual dimensions of those attractions differ drastically. ^ 
As there is no exact test of significance, any nonzero lambda was considered 
to be an Indicator of possible bias. The seventh row of Table 2 shows the 
percentage of Items within each subtest for which a nonzero lambda was found. . 
The rattVp^such Items to the number of items within subtest ranges from 
J:%to 1:2, suggesting that, when wrong answers were selected the two lan- 
guage groups ofteiKbehaved very differently. 

Test of popular distractors. The concluding analysis 1n this series 
asks whether there are any Correct choices which were sufficiently 
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attractive to be classed as popular abstractors. In the final rows of 
Table 2 are shown the percentage o.fJ[tems which meet the lQ«-or-greater 
criterion for the English-language groups, the Spanish-language groups, and 
jointly across groups'. Except in Passage Comprehension at Level 2, the 

i 

i 

Spanish-language group's results show more items with popular' distractors 
than the English-language group. Percent joint overlap is of particular 
Interest, since that value gives another indication of the uniformity of 
behaviors across language groups when selecting incorrect responses. In 
the subtests in this study, the joint overlap of popular distractors is very 
small, suggesting again that many items of the English version of the test 
and the Spanish translation may not be as comparable as the test designers 

intended. « =- — 

The degree of overlap between the five analyses in terms of the number 

t 

of positive findings for each subtest is shown in Table 3. The percentage of 

Insert Table 3 about here 

items for which none of the preceding analyses show evidence of bias is 
remarkable small. Level C Passage Comprehension, for example, has only a 
single item which never shows a difference between the language groups. Over 
half of the items in that subtest have at least two positive findings, and 
four of the items have three positive findings. Table 3 shows that the per- 
centage of items for which three, -four, or five out of five statistical 
indicators yfeld positive results varies from about one-fifth to about two- 
fifths of the items within each subtest. 

o 

Content Analysis ' 

On the basis of the preceding evidence from the statistical approach to 



bias detection tn the CTBS, those! ttems which showed^greement of three or 
more indicators were subjected to a careful analysis of item content. The 
content analysis was a° search for possible linguistic, curricular, and/or 
cultural reasons which might explain; differential performance faetween language 
groups. This portion of the study was undertaken by an educational- researcher 
fluent 1* both English and Spanish, who made extensive reference to the 
curricular materials used by the students in the sample, and consulted with 
native speakers of various dialectsin making an appraisal. Five categories 
were tabulated as possible sources of influence which item content might 
exert on the different language groups: 

a) Mistranslation: the meaning and/or grammatical form of a key word 
_-3!{hSiiHittl^^ from the English original 

In a manner which is an incorrect or inappropriate use of tte 
Spanish language; 

b) Cultural bias: some key word or phrase within the item requires 
familiarity with objects, behaviors, or values which Are not 
normally found In the Spanish and Latino cultures, or which may 
have very different Interpretations; 

c) Linguistic bias: some key word or phrase within the^tem requires 
familiarity with an Idiomatic expression or verbal allusions which, 

• because of innate differences in language, do no translate well; 

d) Low frequency word bias: some key word or phrase within the item 
1s not found, or rarely found, 1n the basal readers used for 
instruction by the students in our sample. 

c) Unfamiliar context bias: some key word or phrase within the item 
' appears in a context which Is quite different from that found for 
the word or phrase In the basal readers used for Instruction. 

An example of Item content judged to bias respondents' 1 s shown by item 
number 29 of the level C Vocabulary subtest, an item for which all statistical 
indicators point to possible trouble. -Item 29 (rated as category c, linguis- 
tic bias) requires the student to select a synonym for "happy". The English- 



language version of the test yielded responses., which appear significantly 
disadvantaged on this particular item. While the correct option for this item 
In the Spanish-language version, /alegre/, was selected 60S of the time by our 
sample, the correct option 1n the English-language version, / gay/, was selected 
only by 13X of the sample, iTie English-language respondents instead split 
their selection equally between two of the remaining options. Only one other 
item 1n the entire test set received as strong a rejection, suggesting that 
among second and third gradersi the^ slang English-language meanthg for 'gay' 
has not only rendered it useless as a synonym for 'happy' but has given it a 
strong pejorative^ flavor as well. 

Table 4 shows data for items in each of the four subtests for which the 
content analysis identified probable sources of bias. The entries 1n the table 

Insert Table 4 about here 

represent tabulations of the content analysis categories for those items on 
each subtest which have three or more statistical Indicators. For the Level 
C Vocabulary subtest, twelve items. have atileast three statistical Indicators; 
nine of those twelve show evidence of linguistic bias, and five of the nine 
show evidence from an- additional category of content bias as well. Three of 
the tour Items from the Level C Passage Comprehension subtest fit at least one 
of the categories of content bias, two of them with multiple Indicators. Only 
four out of nineteen on the Level 2 Vocabulary subtest items with three or 
more statistical Indicators do not have ostensible problems as shown by the 
content anaTysIs procedure." Of twenty-one Items 1n the Level 2 Passage 
Comprehension subtest with three or more Indicators, only three cannot be 
corroborated by the analysis of content. None of the items in any subtest 
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which had no statistical indicators of bias were found to have any content 

i 1 

indicators of bias. 

Table 5 presents a summary of subtest performance by group when those * 
items for which three or mor<e statistical indicators turn up positive are 
excluded. In three of the four subtests," the adjusted scores of the .Spanish- 

Insert Table 5 about here 

language groups move closer jto their English-language counterparts, A» 
substantial difference remains', however, between scores for the Passage 
Comprehension' subtest at Level 2. The gain from .initial to* adjusted group 
mean by the Spanish-language group is quite insufficient to raise .that value 
to the-level of the, English-language group. The adjusted minimum p-values 
achieved by both groups move upward but the English-language group pulls 
ahead noticeably, 

\ 

DISCUSSION 

• — * 

Five relatively simple analyses have been presented which point to five 
related considerations in the search for bias. These are (a) overall group 
differences and the1»- direction, (b) differences in performance by a select , 
subsample Of better respondents within groups, (cj differences from chance 
responding by those -subsamples , (d) differences between groups 1n the se- 
lection of wrpng answers, and (e) degree of distraction provided by wrong 
item choices. The first of these follows the well-known Anghoff delta/ 
procedure (Anghoff, 1972), without resorting to the arbitrary use of rescaling 
which simply serves for added convenience. The second and third analyses 
make use of the select subsample of putative "masters", those students within 
each group whose overall performances place them above the P-curve; these 

c - • 



apprxiaches are extensions of the work of Sato (1980") and colleagues. The 
fourth" and fifth: procedures examine the bias question by studying those pacts ^ 
of the multiple-choice item which are usually excluded from study in a » 
right-wrong scoring context (cf. Ppwell & Isbister, 1974), \ 

For purposes of this paper, t+ie five procedures are considered ^ointiy, t 
with ^qual weights. Interpretations of bias are confirmed in the clear mai lor ity 
of cases where the joint indication of three or fliore statistics is found foV 
an item. Certain problems remiin to be solved, however, and therefore*some 
conditions must be placed on the. use of this set^f approaches to the de- 
tection of item bias. It is clear, for example-, that the first index, because 
it is based on proportion of correct items, is to be used with Caution: 
"proportions of correct answers in a group of examinees 1s not really a measure 
of item difficulty. This proportion describes -not only the test item bu^ 
also the group tested" (Lord s 1980, p. 35). Indeed, throughput it must be . 
remembered that the results of this study are descriptive of this .sample only, 
and no external criteria are available to evaluate comparability across language 

groups by grade. - 

A second objection 1s that the psychometric properties of the CTBS items 
are only partially expresses by reliance on p-values and the S-P chart, which 
at its core relies on the index bf item'difficulty. ° Thus, the conclusions $ 
drawn from work with that chart are only as good as the strength of the item 
difficulty metric. In addition, the S-P chart suffers from other metric 
problems., The first is that the doubly-sortecl persons x items matrix treats 
data, 1n part, as interval rather than continuous -data. - Thus, for instance,, 
subtle gradations of difficulty may. be given the same credence as larger 
differences in the case where p-values are nonuniformly distributed. 



Analogously, nonlinear distributions of total performance scores may contri- 
bute in unknown, .yiays to the use made of ranking information regarding respon- 
dents: 'the patterns may not be as smooth as the chart makes them appear. 
^Moreover, as the S-P chart approaches randomness and its index of discrep- 
ancy, 0\ approaches 1.0, Increasingly complex but hidden interactions be- 
tween the properties of the Items in the test and the attributes of the sam- 
ple are likely* Thus, the second and third statistics^ in the analytic set 
depend upon certain assumptions about the nature of performance patterns, 
violations of which bear rather unclear consequences. Related problems 
appear 1n Item characteristic curve analysis (Linn, Levlne, Hasting, Wardrop, 
1980), and in the "adverse impact" approach (Mer7, 1980). • „ • 

A third objection to the procedures used in this study centers on issues 
pf guessing. In the absence of an externally valid explicit criterion, cor- 
rection for guessing does not seem feasible (Choppin, 1974). Yet assump- 
tions about the occurence and distribution of guessing affect all aspects _ 
of the analysis, particularly statistics which address Incorrect responses. 
Volitional bias, qujte -likely contributing to the anomalous response by the 
EngTlsh-language group to item 29 on tiye Level C Vocabulary subtest, is no- 
where adequately, considered. How much of a^role guessing plays is not well 
treated by tha assumoflort that chance responfl^ig 1s represented by p * .25. 
In the very likely event that some^' members of any group will engaqe in guess- 
ing some of-the time on'some items; only the most general and simplistic 
conclusions can be_ drawn from the data pret r.nted^here. One problem of par- 
tlcular note 1s the strong possibility that guessing assumes a gradient 
distribution within the person, x 1tem^ matrix. That is, from the most capa- 
ble to the least capable person, the contribution of guessing on any Item 
may move from relatively low probability to relatively high probability, 
thus potentially Interfering with diagnosis of problems inherent 



in the Item. But such diagnosis lies at the heart of the effort to decipher 

and describe item bias. Until the gradient problem is separated from the 

bias, problem, only partially satisfactory conclusions can be drawn about either. 

On the positive side, the high level of match between content analysis 
and the aggregate of statistical evidence suggests that this simple approach 
to bias detection may have as much viability as more?l abort ous and unwieldy 
procedures. The ease of computations and Interpretations, and the parsimony 
of explanation are also favorable points (Merz, 1980). While some attempt 
1s made 1n the preceding pages to demonstrate the use of multiple indicators, « 
more possibilities can be pursued within this framework. The explanatory 
power of the five- part 'procedure appears to exceed that offered by analysis 
of variance or ph1/ph1-max, and the assumptions required about the configura- 
tion of. persons and Items are fewer in number fhan those required by the modi- 
fled ch1-square_analyses which recently have been challenged as inadequate 
TffcterascuiTtraTTd Slaughter, in press). 

Comparison of the present set of results with those of more complex 
analytic procedures conducted on the same data set awaits further study. 
However, unlike the results reported by L1nn, Levine, Hastings and War- 
drop (1981), 1n which Item characteristic curve analyses for a hypothetical 
data set "...did not lend themselves to making generalizations about 
features of Items..." (p. 38), the findings of the present study suggest at 
least one concluding observation. Many signals point to a primary conclu- 
sion that/a number of Items 1n the English-language and Spanish-language 
versions of the CTBS do not seem to be comparable. Across a spectrum 
of Indicators, the Spanish- language groups regularly produced lower scores. 
In three Qf four subtests,' removing those items for which three or more 
statistical' indicators pointed to difficulty gave adjusted * 



scores which were very similar between groups. In the fourth subtest, 
that correction did not yield significant improvement, suggesting that the 
Spahtsl^tanwage^ ample at grade G may be- disadvan t ag e d in - some- res pect — 
unrelated to the CTBS itself. 
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Su'.-test 
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TABLE 2 



' Percentage of Items Exceeding 
Critical Minimums in Five Analyses 



Subtest 



Level C 



Level 2 
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Ana lysis 



a) Test of proportions 
of correct scores 
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correct scores for j 
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TABLE 3 " 

Percent of Items Showing Statistical 
Indicators of Differential Performance 
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TABLE 4 

Sources of Content Bias for Items w ith Three or Hore Statistical 
^ indicators of Differential Performance, by subtest 
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, | TABLE 5 



Revis ed Summary of Performance by Subt est Group. Deleting 
x items with Three or More Statistical Indicators 



Subtest . Level C Level 2 

Passage Passage 
I Vocabulary Comprehension Vocabulary Comprehension 



Group English Spanish Eng]ish Spanish English Spanish English Spanish 

i 

adjusted 24 
n< items 21 ™ 21 ™ 

.6804 :6606 ' .6216 * .6061 * .5318 .5322 .5431 .4067 
lX*i fr ™ -0234 .0394 -.0038 .0137 .0219 .1020 .1230 .0969 
S d d UStGd .1298 .1502 1 .0936 .1039 .1418 .1476 .0206 .0235 

rexta .8571 .8542 .7356 .7128 .8568 .7662 .7507 .5707 

adjusted 
minimum 



* djUSted .4104 .3004 .4826 .4343 .3344 .3005 .2366 .1272 



Figure Captions 

Figures 1A and IB: 1A) Perfect Gutman scale for a hypothetical 
ten-Item test scored right (1) and wrong (0). Persons and Items are, 
uniformly ordered, by total correct score and level of difficulty, re- 
spectively. IB) Perfect Gutman scale, showing uniform ordering with 
lower overall performance. 

Jflgures 2A and 2B: 2A) Hypothetical score matrix for a ten- item 
test sorted by respondents on descending total score and, by Items on 
ascending' level of difficulty. S\ and P-curves reflect cumulative ogives 
of performance, and lead to an appraisal of the characteristic perform- 
ance of the group. IB) Hypothetical score matrix for the same test with 

f 

a different group, again sorted by respondents and Items. 

•' 

Figure 3:" Hypothetical patterns of response to two Items by ten 
persons, showing a poorly- fitted and a better- fitted Item. 
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